1 Last minut comments/edits

Note: Als je een error krijgt bij het installeren van de packeges met de melding permission denied, kan je Rstudio opnieuw opstarten als administator: (Rechter muisknop -> Als administrator uitvoeren)

Als je git geinstalleerd hebt kan je de github direct clonen vanuit Rstudio; In Rstudio kunnen jullie de github loaden via projects => Version control => Git => https://github.com/AMCMC/OiBMW_Microbiome

Alternative is downloaden als zip;

ALs je deze rmd file opent in Rstudio kun je de losse “chuncks” runnen.

Some systems require Rtools to be installed (windows); Please installed if required.

2 Assignment

Most of the workflow is fixed and code can be run as is. However, a few options are left for the students to modify.

These and questions are indicated in bold. Plz adress these questions

You can modify this rmd document to adress the questions.

Plz highlight your answers using:

<p style=“color:red”> Your answer !</p>

Your answer !

Alternatively you can generate a pdf export form a word document if preferred. Documents can be uploaded in the module on Canvas before April 24th.

Note: It is not required to exactly read the code, but rather get a grasp of how the data is being processed and how this might impact downstream results

3 Introduction

To determine the microbiome composition of a particular environment amplicon sequencing is the most commonly used approach. Specific primers targeting universal genes are used to amplify these taxonomic marker sequences. The obtained amplified sequences (amplicons) can than be sequenced using various platforms. The main task when analyzing these datasets is to determine which are true biological sequences and which are technical noise.

Fig 1. Single step 16S rRNA gene amplicon Illumina library preparation

Fig 1. Single step 16S rRNA gene amplicon Illumina library preparation

During this practical we will analyze some 16S rRNA amplicon sequences obtained from a mock sample. Amplicons were generated using universal primers targeting the V3V4 regions of the 16S rRNA gene and were sequenced on an Illumina MiSeq platform (2x251) using V3 chemistry. The sample used is a mock sample. This sample was artificially created using DNA from in total 55 unique bacterial strains (Table 1 - MC3). Since we know the true composition of the sample we can use it benchmark our wetlab protocols and bioinformatic pipelines.

During this workshop will apply DADA2 to infer the taxonomic composition of the data and compare that to the expected composition. DADA2 is a software package specifically developed to infer the true amplicon seqeuence variants (ASV) of a particular sample. Part of this workshop was based on the tutorial from the dada2 page.

Other packages used: For visualization we use ggplot2. To merge and organize microbiome data we use phyloseq DECIPHER and phangorn are used to generate a phylogenetic tree.

4 Configure R environment and get data

First we install and load the necessary packages.

.cran_packages <- c("ggplot2", "reshape2", "stringr","phangorn","DT","ape","ggrepel","knitr","Hmisc")
.bioc_packages <- c("dada2", "phyloseq","ShortRead","DECIPHER","ggtree")
.inst <- .cran_packages %in% installed.packages()
if(any(!.inst)) {
   install.packages(.cran_packages[!.inst])
}

# R version check!
.inst.bc <- .bioc_packages %in% installed.packages()
if(any(!.inst)) {
   source("http://bioconductor.org/biocLite.R")
   biocLite(.bioc_packages[!.inst], ask = F)
}

.inst.bc <- .bioc_packages %in% installed.packages()
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12")
BiocManager::install(.bioc_packages[!.inst.bc])

sapply(c(.cran_packages, .bioc_packages), require, character.only = TRUE, quietly = T)
##   ggplot2  reshape2   stringr  phangorn        DT       ape   ggrepel     knitr 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE 
##     Hmisc     dada2  phyloseq ShortRead  DECIPHER    ggtree 
##      TRUE      TRUE      TRUE      TRUE      TRUE      TRUE
theme_set(theme_bw())

5 Raw fastq preprocessing

The initial part of the DADA2 workflow consists of preprocessing of the raw fastq files. Data quality is assessed, and several pipeline parameters can be tweaked for optimal analysis.

DADA2 paired-end read handling is somewhat quircky, the forward and reverse reads are processed independantly and only in the end are the merged into a single representative sequence.

5.1 Plot Read quality

First have a look at the size and quality of the data.

FQF <- "./Raw_data/L3_MOCK1.F.fastq.gz"
FQR <- "./Raw_data/L3_MOCK1.R.fastq.gz"
plotQualityProfile(c(FQF,FQR))

So we see the sample has 74333 paired end reads. Furthermore, the quality of the reverse read is poorer than that of the forward read.

5.2 Expected error rate

A bit more intuitive way to look at the quality of the reads, is to look at the expected error rate. The following code reads the phred quality score from the fastq files and calculates the overall expected error for each of the reads. We will only take 1000 reads and plot the cumulative error rate along the read. The quality score is the log transformed likelihood that a particular call was erroneous.

# get the expected error rates and generate the data.frame
cumelative_error_F <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQF)[1:1000])))/10),1,cumsum)
cumelative_error_R <- apply(10^(-as.matrix(PhredQuality(quality(readFastq(FQR)[1:1000])))/10),1,cumsum)

# convert the data into long format (required for ggplot2)
cumelative_error_F_long <- data.frame(reshape2::melt(cumelative_error_F), read="Forward")
cumelative_error_R_long <- data.frame(reshape2::melt(cumelative_error_R), read="Reverse")
df.long <- rbind(cumelative_error_F_long, cumelative_error_R_long)
#df.long$Var1 <- factor(ceiling(df.long$Var1/10)*10) # bin the cycles into ba

ggplot(df.long, aes(x=Var1, y=value)) + 
  labs(y="Cumulative expected error rate", x="Cycle number") + 
  facet_wrap(~read) + 
  geom_smooth(stat = 'summary', color = 'red', fill = 'red', alpha = 0.2, 
                fun.data = median_hilow, fun.args = list(conf.int = 0.9)) + 
  geom_smooth(stat = 'summary', color = 'blue', fill = 'blue', alpha = 0.2, 
                fun.data = median_hilow, fun.args = list(conf.int = 0.50))

This shows that very few sequences are actually without error, and we can expect, on average, 2 errors in the forward read and 3 in the reverse read.

5.3 Read duplication

If we assume that most errors are random, we can reason that these errors will results in unique, singleton sequences, while sequences that are replicated are more likely to be true variants.

Dada2 requires a minimal of two observations of a sequence for it be even considered to be a true variant. We use the derepFastq function from DADA2 to dereplicate the forward and reverse reads.

drr <- derepFastq(FQF)
drr
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 59118 unique sequences
##   Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension:  59118 251
##   Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences:  53754 5696 9443 25284 17014 ...
sum(table(drr$map)==1)
## [1] 56871
sum(table(drr$map)!=1)
## [1] 2247
drfut <- derepFastq(FQR)
drfut
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 68629 unique sequences
##   Sequence lengths: min=251, median=251, max=251
## $quals: Quality matrix dimension:  68629 251
##   Consensus quality scores: min=7, median=31, max=38
## $map: Map from reads to unique sequences:  38277 22800 68505 24182 30459 ...
sum(table(drfut$map)==1)
## [1] 67474
sum(table(drfut$map)!=1)
## [1] 1155

Out of the 74333 forward reads a total of 56871 (76%) are unique and only occur once! This is an issue for amplicon inference because we can only infer ASVs from duplicated reads.

5.4 Read trimming

Luckily we can remove some of the poor quality regions at the end of the reads reducing the number of unique sequences. The length of the V3V4 amplicons are between 400 and 430 bases, which means there is a 70-100 basepair overlap between the forward and reverse read. In order to merge the forward and reverse reads downstream, we need at least a 20 bp overlap. Thus the sum of the read length must be at least 450 bp.

What would be appropriate read trim length?

In contrast to the dada2 tutorial we do not filter on expected errors. We have found that sequence quality differs between amplicons and therefore quality filtering will skew the final composition.

length_trimmed_forward=240 # not optimal, check the error profiles for optimal trimming
length_trimmed_reverse=240 # not optimal, check the error profiles for optimal trimming
  
# filtered output files
FQFF <- "FQFF.gz"
FQRF <- "FQRF.gz"

out <- filterAndTrim(FQF, FQFF, FQR, FQRF,
                     truncLen=c(length_trimmed_forward,length_trimmed_reverse),
                     maxN=0, truncQ=0, rm.phix=F, compress=TRUE)
out
##                     reads.in reads.out
## L3_MOCK1.F.fastq.gz    74333     74333

Check the duplication rate after trimming.

drf <- derepFastq(FQFF)
drf
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 55655 unique sequences
##   Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension:  55655 240
##   Consensus quality scores: min=7, median=35, max=38
## $map: Map from reads to unique sequences:  50532 5878 9482 24071 16532 ...
sum(table(drf$map)==1)
## [1] 53128
drr <- derepFastq(FQRF)
drr
## derep-class: R object describing dereplicated sequencing reads
## $uniques: 74333 reads in 65323 unique sequences
##   Sequence lengths: min=240, median=240, max=240
## $quals: Quality matrix dimension:  65323 240
##   Consensus quality scores: min=7, median=32, max=38
## $map: Map from reads to unique sequences:  36597 234 65199 23704 29362 ...
sum(table(drr$map)==1)
## [1] 63749
What happened tot the expected error rate and read duplication?

Answer?

5.5 Calculation of the batch specific error

The second step in the process is the estimation of the batch specific error rates. Use of the quality score to test the validity of a sequence is what sets dada2 appart from the other popular ASV inference methods like Deblur and UNOISE3.

Error estimation is somewhat computational demanding and therefore pre-calculated error rates for this particular sequence run are supplied. (Otherwise error rates can be obtained using ?learnErrors). The following plots can be used to asses the error models and should show linear decrease in substitution rates over quality scores.

errF <- readRDS("./Raw_data/L3_dada2.errF.RDS")
plotErrors(errF, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis

errR <- readRDS("./Raw_data/L3_dada2.errR.RDS")
plotErrors(errR, nominalQ=TRUE)
## Warning: Transformation introduced infinite values in continuous y-axis

Points show the quality score dependent substitution rates while the red line shows the expected rates.

What do you observe regarding the expected vs the real error frequency?

6 ASV Inference

The third step in the dada2 workflow is the actual inference of the ASVs.

asv.f <- dada(derep = drf, err = errF)
## Sample 1 - 74333 reads in 55655 unique sequences.
asv.f
## dada-class: object describing DADA2 denoising results
## 217 sequence variants were inferred from 55655 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
asv.r <- dada(derep = drr, err = errR)
## Sample 1 - 74333 reads in 65323 unique sequences.
asv.r
## dada-class: object describing DADA2 denoising results
## 70 sequence variants were inferred from 65323 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
table(is.na(asv.f$map[drr$map]),is.na(asv.r$map[drf$map]))
##        
##         FALSE  TRUE
##   FALSE 55789  5068
##   TRUE  12451  1025
sum(is.na(asv.f$map[drr$map]) | is.na(asv.r$map[drf$map]))
## [1] 18544

DADA2 inferred 217 amplicon sequences from the forward reads and 70 from the reverse reads. Furthermore DADA2 did not find an suitable ASV representative for 18544 (25%) of the reads. NOTE:These statistics change when different filtering criteria are used

How do the forward and reverse read differ in performance? How could trimming differently improve these statistics?

6.1 read to ASV mapping

Old school OTU clustering methods rely on fixed sequence distances.

Lets have a look at the read to ASV mapping. Lets select an arbitrary forward ASV and trace the reads. Both the derep and dada-class objects have a mapping file. Lets first retrieve the reads from the fastq file.

asv.f$denoised[4]
## TGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCGACGCCGCGTGAAGGAAGAAGTATCTCGGTATGTAAACTTCTATCAGCAGGGAAGATAGTGACGGTACCTGACTAAGAAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTGTGGCAAGTCTGATGTGAAAGGCATGGGCTCAA 
##                                                                                                                                                                                                                                             2457
F_ASV_4.FQ <- Biostrings::readQualityScaledDNAStringSet(FQFF)[drf$map %in% which(asv.f$map %in% "4")]

In total 2457 reads map to this ASV.

When we align the reads with the ASV we can calculate the hamming distance, which should represent the true sequencing error.

read_asv_ham_dist <- nwhamming(as.character(F_ASV_4.FQ), asv.f$sequence[4])
summary(read_asv_ham_dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   2.000   3.949   5.000  33.000
barplot(table(read_asv_ham_dist)); abline(v = 2.4*3, col="red")

sum(read_asv_ham_dist>7.2)
## [1] 410

As we can see most of the reads have at least one error and roughly 15% of the reads have more than 3% error rate.

If we take this particular subset of sequences and would run a OTU clustering based approach, what would the results look like?

We can now also explore the expected error rate versus the true erro rate.

plot(apply(10^(-as.matrix(PhredQuality(F_ASV_4.FQ@quality))/10),1,sum), read_asv_ham_dist, 
     xlab="Expected Error", ylab="Read-ASV Hamming distance")

What do you observe in terms of expected error versus true error rate?

6.2 ASV pair merging

To get a single representative ASV sequence, forward and reverse ASV pairs are merged. Only those pairs that can be joined with at least a 20 bases overlap and without any mismatches are accepted.

asv.m <- mergePairs(dadaF = asv.f, derepF = drf, dadaR = asv.r, derepR = drr, returnRejects = T, maxMismatch = 0)
## Duplicate sequences in merged output.
DT::datatable(asv.m)

This table shows the statistics for all the asv pairs. Forward and reverse read mismatching is a common problem of Illumina sequencing. Therefore significant amount of reads are lost in this process.

How many ASVs do we have and how many reads do they represent?

length(asv.m$accept)
## [1] 3592
sum(asv.m$accept)
## [1] 372
sum(asv.m[asv.m$accept,]$abundance)
## [1] 52071
sum(asv.m[!asv.m$accept,]$abundance)
## [1] 13436

Out of 3592 possible combinations 372 valid merged ASVs are generated. A total of 13436 (15%) reads belong to read pairs which do not properly merge. In total 30 % of the reads have been lost during processing of the data.

6.3 Chimera removal

During PCR and bridge amplification potential chimeric sequences are generated. These hybrid sequences generate artificial links in the phylogenetic relation of the ASVs. Therefore the ASVs are screened for chimeras. V3V4 amplicons have a conserved region in the middle and are therefore especially prone to chimera formation

Fig 2. Chimeric sequences

DADA2 has a function isBimeraDenovo to detect these chimeric sequences.

seqtab <- makeSequenceTable(asv.m[asv.m$accept,])
bimeras <- isBimeraDenovo(seqtab)
asv.m$bimera <- bimeras[asv.m$sequence]
sum(bimeras)
## [1] 289
sum(na.omit(asv.m$abundance[asv.m$bimera]))
## [1] 8055

A total of 289 ASVs were identified as possible chimeric sequences, representing 8055 reads (15%). When we remove these we will have completed the ASV inference and we will get our final ASV abundances.

seqtab.nochim <- seqtab[,names(which(!bimeras))]
length(seqtab.nochim)
## [1] 83
sum(seqtab.nochim)
## [1] 44016

In the end a total of 83 valid ASVs were inferred representing 44016 reads (57%). This means we lost a total 43% of the reads while processing the data. These are quite common statistics.

How could these statistics be improved?

How could the loss of reads impact interpretation of the data?

7 Precision and recall

To determine the accuracy and precision of our data, we compare the valid ASVs to to the sequences of the reference strains. The reference sequences for each member have been obtained by sanger sequencing. Sanger sequencing does not allow to discriminate between isoforms and therefore the reference sequences contain some ambiguous basecalls. Thus we can expect some discrepancies between the ASV and the reference sequences.

7.1 Get the mock data

The reference sequences contain the entire 16S gene. In order to compare it to the data we can perform an in silico pcr to determine the expected ASV.

asv.valid.seqs <- names(seqtab.nochim)
# read in the sequences of the mock
mockref <- as.character(readDNAStringSet("./Raw_data/Mock_reference_NoN.fasta"))

insilicopcr <- list()
for(ref in names(mockref)){
  refseq <- mockref[ref]
  start <- vmatchPattern(pattern = "CCTACGGGAGGCAGCAG", subject = refseq,
                         max.mismatch=1, min.mismatch=0,
                         with.indels=FALSE, fixed=TRUE,
                         algorithm="auto")

  end <- vmatchPattern(pattern = "GGATTAGATACCCTTGTAGTC", subject = refseq,
                       max.mismatch=5, min.mismatch=0,
                       with.indels=FALSE, fixed=TRUE,
                       algorithm="auto")

  start.i <- endIndex(start)[[1]]+1
  end.i <- startIndex(end)[[1]]-1

  if (!(isEmpty(start.i) | isEmpty(end.i))){insilicopcr[[ref]] <- Biostrings::subseq(refseq, start = start.i, end = end.i)}
}

mock.comp <- data.frame(read.csv("./Raw_data/Mock.composition.txt", sep=",", header = T, row.names = 1))
mock.comp$ASV <- unname(unlist(insilicopcr))

mock.comp$ASV %in% names(seqtab.nochim)
##  [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [37] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [49]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
DT::datatable(mock.comp)

As we can see many of the MOCK in silico amplicons have identical sequences to the obtained ASVs.

Lets see how well they actually match by calculating the pairwise hamming distance between the sample and mock members.

hamming.list <- lapply(asv.valid.seqs, function(x) nwhamming(x, mock.comp$ASV))
hamming.distance.mat <- do.call(rbind, hamming.list)
colnames(hamming.distance.mat) <- rownames(mock.comp)
rownames(hamming.distance.mat) <- asv.valid.seqs
#hamming.distance.mat.long <- setNames(melt(hamming.distance.mat), c('asv', 'mock', 'dist'))
#Get the closest relative for each mock and asv
mockrefdist <- data.frame(dist = colMins(hamming.distance.mat),
                          name = names(mockref))

ggplot(mockrefdist, aes(x=name, y=dist)) + 
  geom_bar(stat="identity") +
  coord_flip() + 
  labs(y="Minimum hamming distance", title="Distance mock to representive asv (Recall)", x=NULL)

54 of the mock members have a representative ASV in the sample. Only a single species, MC_2 Micrococcus MC_2 does not have a good matching ASV. So we can say that we have a recall of 98% in detection.

7.2 Precision

Besides recall another import measure is precision. In this case how many of the ASVs are actually derived from the mock. Lets assume that all ASVS with a higher than five hamming distance to the mock members are false possitives.

asvrefdist <- data.frame(dist = rowMin(hamming.distance.mat),
                         name = paste0("ASV_",stringr::str_pad(as.character(1:length(asv.valid.seqs)), width=3, pad = "0")))

ggplot(asvrefdist, aes(x=name, y=dist)) + 
  geom_bar(stat="identity") +
  coord_flip() +
  ggplot2::geom_hline(yintercept = 5, col='red') + 
  labs(y="Minimum hamming distance", title="Distance asv to mock representive (Precision)", x=NULL)

# False positive ASVs
sum(asvrefdist$dist>=5)
## [1] 20
# False positive ASVs total reads
sum(seqtab.nochim[asvrefdist$dist>=6])
## [1] 1694

As we can see there were quiet some ASVs which are not likely to be derived from the mock members. In binary terms, ie presence or absence, we can say we have a precision of 68/80 which is 79%. Most of these are actually stealthy chimeras.

What would the precision be if we consider the reads rather than the ASVs?

8 ASV Annotation

Since ASV sequences by themselve are none informative we need to put them into context. We use phylogenetic trees to put the sequences in evolutionary context, while we use taxonomy in order to put them in scientific context.

8.1 Assigning taxonomy

Taxonomy is how we classify organisms into specific groups at certain ranks. Taxonomy tries to capture phylogeny (evolutionary relationships) at a coarse level.

Fig 3. Main taxonomic ranks

Fig 3. Main taxonomic ranks

Lets assign taxonomy to the inferred ASV sequences. DADA2 implements a variant of the (RDP) naive bayes classifier. In short kmer profiles from the ASV sequences are compared to those in a curated taxonomy reference database. In this case the reference database is Silva V132.

tax <- assignTaxonomy(seqs = asv.valid.seqs,"./Raw_data/silva_nr_v132_train_set.fa.gz")
# lets inlcude the Genus taxonomy in the name of the ASV
asvrefdist$name_genus <- paste(asvrefdist$name,tax[asv.valid.seqs,"Genus"])

8.2 Phylogenetic tree

A phylogenetic tree tries to capture the evolutionary relationships of various organims. We can use the sequence (dis)simmilarity of the ASVs to infer these phylogenetic relationsships. These trees help in generating context for our inferred sequences.

seqs <- c(asv.valid.seqs,mock.comp$ASV)
names(seqs) <- c(asvrefdist$name_genus,mockrefdist$name)
alignment <- AlignSeqs(DNAStringSet(seqs), anchor=NA)
## Determining distance matrix based on shared 8-mers:
## ================================================================================
## 
## Time difference of 0.16 secs
## 
## Clustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.03 secs
## 
## Aligning Sequences:
## ================================================================================
## 
## Time difference of 1.95 secs
## 
## Iteration 1 of 2:
## 
## Determining distance matrix based on alignment:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Reclustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.02 secs
## 
## Realigning Sequences:
## ================================================================================
## 
## Time difference of 0.59 secs
## 
## Iteration 2 of 2:
## 
## Determining distance matrix based on alignment:
## ================================================================================
## 
## Time difference of 0.01 secs
## 
## Reclustering into groups by similarity:
## ================================================================================
## 
## Time difference of 0.02 secs
## 
## Realigning Sequences:
## ================================================================================
## 
## Time difference of 0.16 secs
phang.align <- phyDat(as(alignment, "matrix"), type="DNA")
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm)

fit = pml(treeNJ, data=phang.align)
## negative edges length changed to 0!
fitGTR <- update(fit, k=4, inv=0.2)
#fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
#                      rearrangement = "stochastic", control = pml.control(trace = 0))
fitGTR$tree <- midpoint(fitGTR$tree)

Lets visualize the tree

groupInfo <- rep("Silva",length(fitGTR$tree$tip.label))
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist>=5,]$name_genus] <- "Xeno_ASV"
groupInfo[fitGTR$tree$tip.label %in% asvrefdist[asvrefdist$dist<5,]$name_genus] <- "Valid_ASV"
groupInfo[substr(fitGTR$tree$tip.label,1,3)=="MC_"] <- "MOCK"

groupInfo <- split(fitGTR$tree$tip.label, groupInfo)

fitGTR$tree <- groupOTU(fitGTR$tree, groupInfo)

test <- fitGTR$tree
attr(test, "Source") <- attributes(fitGTR$tree)$group


p <- ggtree(test, layout = "rectangular")+ geom_tiplab(size=3, aes(color=Source),key_glyph = "rect", ) + ggplot2::xlim(0, 0.3)
## Found more than one class "phylo" in cache; using the first, from namespace 'phyloseq'
## Also defined by 'tidytree'
p

We can see that most of the ASVs and mock sequences occur in pairs and their associated taxonomies are mostly congruent. The sequences that do not get classified at Genus level (NA) are all

8.3 Abundance correlation

Various steps in generating and processing of the data results in bias in the observed final composition. Since taxonomy is too coarse grained and potentially biased, lets cluster the ASVs based on the distance in the tree.

dd = as.dist(cophenetic.phylo(fitGTR$tree))
psclust = cutree(hclust(dd), h = 0.05)

I’ve aggregated the tree tips at a semi arbitrary distance of 0.05. How would the results change if I would use a much higher or lower cutoff?

Lets aggregate the ASV abundance and mock composition, based on the clustering of the tree. Once weve clustered the ASV and mock sequences we can compare the expected versus the observed relative abundances.

seqtab.nochim.relabu <- seqtab.nochim/sum(seqtab.nochim)

Abundances <- c(seqtab.nochim.relabu, mock.comp$MC3/100)
names(Abundances) <- names(seqs)

cluster_list <- split(names(psclust), psclust)
names(cluster_list) <- unlist(lapply(cluster_list, function(x) names(which.max(Abundances[x])))) # representative sequence

cl.abundance_sample <- aggregate(seqtab.nochim.relabu, by=list(psclust[c(asvrefdist$name_genus)]), FUN=sum)
cl.abundance_mock <- aggregate(mock.comp$MC3/100, by=list(psclust[c(mockrefdist$name)]), FUN=sum)

rownames(cl.abundance_sample) <- names(cluster_list)[cl.abundance_sample[,1]]
rownames(cl.abundance_mock) <- names(cluster_list)[cl.abundance_mock[,1]]

df <- data.frame(label=names(cluster_list),
                 sample=cl.abundance_sample[names(cluster_list),2],
                 mock=cl.abundance_mock[names(cluster_list),2]
                 )
df[is.na(df)] <- 0

ggplot(df, aes(x=sample, y=mock)) + 
  geom_point(size=3) + 
  ggrepel::geom_text_repel(aes(label=label)) + 
  geom_abline(slope = 1) + 
  #scale_x_log10() + 
  #scale_y_log10() +
  NULL
## Warning: ggrepel: 27 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

While we can observe some bias for specific groups like Akkermansia and Roseburia overall there is significant correlation betweem the observed and expected abundance.

How does the observed species specific bias impact the data? Note, we measured a fixed set of reads